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We introduce and formulate two types of random-walk domina- 
tion problems in graphs motivated by a number of applications in 
practice (e.g., item-placement problem in online social network, 
Ads-placement problem in advertisement networks, and resource- 
placement problem in P2P networks). Specifically, given a graph 
G, the goal of the first type of random-walk domination problem is 
to target k nodes such that the total hitting time of an L-length ran- 
dom walk starting from the remaining nodes to the targeted nodes is 
minimal. The second type of random-walk domination problem is 
to find k nodes to maximize the expected number of nodes that hit 
any one targeted node through an L-length random walk. We prove 
that these problems are two special instances of the submodular 
set function maximization with cardinality constraint problem. To 
solve them effectively, we propose a dynamic-programming (DP) 
based greedy algorithm which is with near-optimal performance 
guarantee. The DP-based greedy algorithm, however, is not very 
efficient due to the expensive marginal gain evaluation. To fur- 
ther speed up the algorithm, we propose an approximate greedy 
algorithm with linear time complexity w.r.t. the graph size and also 
with near-optimal performance guarantee. The approximate greedy 
algorithm is based on a carefully designed random-walk sampling 
and sample-materialization techniques. Extensive experiments demon- 
strate the effectiveness, efficiency and scalability of the proposed 
algorithms. 

1. INTRODUCTION 

Given a graph G — (V, E) with n = \V\ nodes and m = \E\, 
how can we quickly target k nodes such that the targeted nodes can 
be easily reached by the remaining nodes through L-length random 
walk where the random-walk moves at most L hops? And how can 
we rapidly find k nodes so as to maximize the expected number 
of nodes that hit any one targeted node by the L-length random 
walk? We refer to these two problems as two types of random-walk 
domination problems, because a node hits any one targeted node 
can be regarded as that the targeted nodes dominate such a node by 
an L-length random walk. Intuitively, the random-walk domination 
problems are very hard because there are C„ possible solutions 
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and for each solution one should perform n — k calculations to 
check (or record the hitting time) whether or not a node reaches any 
one targeted node by the L-length random walk. These problems 
are encountered in many data mining and social network analysis 
applications. Some of them are discussed as follows. 

1.1 Motivation 

Item-placement problem in online social networks: Recently, 
social networking services are becoming an important media for 
users to search for information online B17II16II26] 13111101 . In many 
online social networks, users find information primarily rely on a 
social process called social browsing 11711161 . In particular, social 
browsing depicts a process that the users in a social network find in- 
formation along their social ties [[17. 16). For example, in an online 
photo-sharing website Flickr l |http : //www . f lickr . com/) , a 
user can view his friends' photos via visiting their home-page. Once 
the user arrives at one of his friends' home-page, then he is also able 
to apply the same way to browse the photos created by his friend's 
friends. Clearly, the next home-page that a user visits only depends 
on the current home-page that the user stays. Therefore, a user's 
social browsing process can be regarded as a random-walk process 
on the social network. Furthermore, users typically has an implicit 
time limit to browse the others' home-pages because users cannot 
browse infinite number of home-pages. As a result, we can model 
the social browsing process as an L-length random walk by assum- 
ing that each user visits at most L home-pages in a social browsing 
process. 

Based on the social browsing process, two interesting questions 
are: (1) how to place items (e.g., news, photos, videos, and ap- 
plications) on a small fraction of users in a social network so that 
the other users can easily discover such items via social browsing, 
and (2) how to place items on a small fraction of users so that as 
many users as possible can search for such items by social brows- 
ing. Let us consider a more concrete application in Facebook social 
network. Assume that an application developer wants to popularize 
his Facebook application. Then, he may select a small fraction of 
users, say k users, to install his application for free. Note that in 
Facebook, if a user has installed an application, then his friends can 
view such an application by browsing his home-page (social brows- 
ing). Therefore, the question is that how to select k users so that 
the other users can easily find such an application (or as many users 
as possible can find such an application) which is equivalent to the 
question (1) (question (2)). Since we model the social browsing 
process as an L-length random walk, these questions are actually 
two instances of the random- walk domination problems. 

Optimizing Ads-placement in advertisement networks: Simi- 
lar example is also encountered in online advertisement networks, 
where an advertisement developer would like to place an advertise- 
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ment (Ad) on a small fraction of users (he may pay for these users) 
such that it can be easily found by other users via social browsing 
(or as many users as possible can find such an Ad by social brows- 
ing). Likewise, we can model the user information-finding process 
in the advertisement networks as an L-length random walk. As a 
consequence, these problems become two instances of the random- 
walk domination problems. 

Accelerating resource search in P2P networks: The study of 
the random-walk domination problems could also be beneficial to 
accelerate resource search in P2P networks. Specifically, in P2P 
network, how to place resources on a small number of peers such 
that other peers can easily search for such resources via some pre- 
specified search strategies. In P2P networks, a commonly-used 
search strategy is based on random walk |5 J. Moreover, a resource- 
search process in P2P networks typically has a lifespan. That is 
to say, the resource-search process generally has a time or hops 
limit. Therefore, we can also model the resource-search process in 
P2P network as an L-lengfh random walk, i.e., the resource-search 
process searches at most L peers in its lifespan. Clearly, based 
on the L-length random walk, the resource placement problem in 
P2P network is an instance of the random-walk domination prob- 
lem. Therefore, using the results of the random-walk domination 
problems can accelerate the resource search in P2P networks . 

1.2 Our main contributions 

This paper present the first study on the random-walk domina- 
tion problems. Our goal is to formulate the random-walk domi- 
nation problems and devise efficient and effective algorithms for 
these problems which can be directly applied to all the above appli- 
cations. In particular, we first formulate two types of random-walk 
domination problems described above as two discrete optimization 
problems respectively. Then, we prove that these two problems are 
the instance of submodular set function maximization with cardi- 
nality constraint problem 11271 . In general, such problems are NP- 
hard 127 1 . Therefore, we resort to develop approximate algorithms 
to solve them efficiently. To this end, we devise a dynamical pro- 
gramming (DP) based greedy algorithm to solve these problems 
effectively. By a well-known result 11271 . the DP-based greedy al- 
gorithm achieves a 1 — 1/e (~ 0.63) approximation factor. How- 
ever, the time complexity of the DP-based greedy algorithm is over 
cubic w.r.t. the network size, thus it can only work well in the small 
graphs. To overcome this drawback, we develop an approximate 
greedy algorithm based on a carefully designed random-walk sam- 
pling and sample materialization techniques. The time and space 
complexity of the approximate greedy algorithm are linear w.r.t. 
the graph size, thereby it can be scalable to handle large graphs. 
Moreover, we show that the approximate greedy algorithm is able 
to achieve al — 1/e — e approximation factor, where e is a very 
small constant. Finally, we conduct comprehensive experiments 
over both synthetic and real-world graph datasets. The results in- 
dicate that the approximate greedy algorithm achieves very similar 
performance as the DP-based greedy algorithm, and it substantially 
outperforms the other baselines. In addition, the results demon- 
strate that the approximate greedy algorithm scales linearly w.r.t. 
the graph size. 

The rest of this paper is organized as follows. Below, we will 
briefly review the existing studies that are related to ours. After 
that, we formulate the random- walk domination problems in Sec- 
tion^ We propose the DP-based greedy algorithm and the approx- 
imate greedy algorithm for solving the random-walk domination 
problems in Section[3] Extensive experiments are reported in Sec- 
tion|4] We conclude this work in Section[5] 



1.3 Related work 

Our problems are closely related to the dominating set prob- 
lem in graphs. Dominating set problem is a classic NP-hard prob- 
lem which has been well-studied in the literature J8][7|. There is 
an O(logn) approximate algorithm for solving this problem effi- 
ciently (7j. Moreover, it has turned out that such an approximation 
factor is optimal unless P=NP (7] [4). The dominating set prob- 
lem has been widely-studied in the networking community due to 
a large number of applications in wireless sensor networks H34II32I 
1241 and other Ad Hoc networks 133-2]. Recently, many different 
extensions of the dominating set problem have also been investi- 
gated. Notable examples include the distributed dominating set 
problem 1151 , the connected dominating set problem 1281 [5] 1321 
1331 , the Steiner connected dominating set problem j5j, and the k- 
dominating set problem 1711341 . All of these extensions are based 
on the traditional definition of domination (8) where the nodes de- 
terministically dominate their immediate (or L-hop) neighbors. In 
our work, the problems are based on a newly defined concept called 
random-walk domination in which the targeted nodes dominate an 
L-hop neighbor if and only if such a neighbor-node hits one of the 
targeted nodes through an L-length random walk. 

Our work is also related to the submodular set function max- 
imization problem 1271 . In general, the problem of submodular 
function maximization subject to cardinality constraint is NP-hard. 
Nemhauser et al. 1271 propose a greedy algorithm with 1 — 1/e 
approximation factor to settle this issue. Recently, many applica- 
tions are formulated as the submodular set function maximization 
subject to cardinality constraint problem. Some notable examples 
include the classic maximal k coverage problem [4], the influence 
maximization problem in social networks 111 11 . the outbreak detec- 
tion problem in networks 1191 , the observation selection and sensor 
placement problem 11211141 . the document summarization problem 
1221 1231 , the privacy preserving data publishing problem 0131 , the 
diversified ranking problem |20|[211, and the filter-placement prob- 
lem (3). All of these problems can be approximately solved by the 
greedy algorithm given in 1271 . Here we study two random- walk 
domination problems in graphs, and we show that both of them 
can also be formulated as the submodular set function maximiza- 
tion with cardinality constraint problem. Also, we present a near- 
optimal approximate greedy algorithm to solve them efficiently. 

2. PROBLEMS STATEMENT 

Consider an undirected and un- weighted graph G = (V,E), 
where V denotes a set of nodes and E denotes a set of undirected 
edges. Let n — \V\ and m = \E\ be the number of nodes and 
the number of edges in G respectively. Although we only focus 
on undirected and un-weighted graphs in this paper, the proposed 
techniques can also be easily extended to directed and weighted 
graphs. Below, we first introduce some important concepts about 
random walk on graphs, and then we formulate two different types 
of random-walk domination problems. 

A random walk on an undirected and un-weighted graph denotes 
the following process. Given an undirected and un-weighted graph 
G and a starting node u, the random walk picks a neighbor-node of 
u uniformly at random and moves to this neighbor-node, and then 
follows this way recursively 1251 . In this work, we address to a 
general random walk model called L-length random walk, where 
the path-length of the random walk is bounded by L 1291 . It is 
important to note that the traditional random walk is a special case 
of the L-length random walk by setting the parameter L to infinity. 
Moreover, as discussed in Section Q] many practical applications 
should be modeled by the L-length random walk. Let us consider 
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Figure 1: Running example. 

a graph shown in Fig.Q] Assume that L — 4. Then, two possible 
paths generated by an L-length random walk starting from vi are 
(vi ,V2,V3,V2,vq) and («i , ve , V2 , V3 , Vs ) . In which, both of them 
have a length 4. Notice that the nodes could be repeatedly visited 
by the L-length random walk. For instance, in the first path, V2 is 
visited twice by the L-length random walk. 

Next, we define an important concept called hitting time for the 
L-length random walk. In particular, the hitting time between a 
source and targeted node measures the expected number of hops 
taken by an L-length random walk which starts at the source node 
and ends at the targeted node for the first time. Formally, denote by 
Z l u the position of an L-length random walk starting from node u 
at discrete time t. Let Tt„ be a random variable defined as 



min{min{t : Z u 



,t > 0},L}. 



(1) 



Then, the hitting time between node u and v denoted by h uv is 
defined by the expectation of T^ v , i.e., = E[T^]. By this 
definition, the following lemma immediately holds. 

Lemma 2.1: For any two nodes u and v, the hitting time h^ v is 



bounded by L, i.e., h„ 



EK,1 < L. 



The following theorem shows that the exact hitting time between 
two nodes can be computed recursively. 

Theorem 1.\:Let d u be the degree of node u and p uw = 1/rfu be 
the transition probability. Then, for any nodes u and v, h uv can be 
recursively computed by 



0,u 
1 + 
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(2) 



where h^ 1 denotes the hitting time between w and v based on an 
(L — \)-length random walk. 

Proof: See Appendix. n 
We remark that in 1291 , Sarkar and Moore define the hitting time 
of the L-length random walk in a recursive manner which is given 
in Eq. (2}. Note that our definition is more intuitive than their defi- 
nition because our definition is based on Eq. (fl) in which the hitting 
time is "explicitly" bounded by L. In the above theorem, we show 
that our definition of hitting time can be computed by the same re- 
cursive equation (Eq. (|2j) as defined in 1291 . Furthermore, based 
on Eq. Q3, it is very easy to design a sampling-based algorithm to 
estimate the hitting time. We will illustrate this point in Section[3] 

2.1 The random- walk domination problems 

Based on the L-length random walk model, we introduce two 
types of random-walk domination problems in graphs. First, we 
describe the first type of random-walk domination problem as fol- 
lows. Denote by S C V a subset of nodes. Assume that there is 
an L-length random walk starting from a node u € V, If such a 
random walk reaches any node in S at any discrete time in [0, L], 
we call that u hits S or 5 dominates u by an L-length random 
walk. For example, consider a graph shown in Fig. Q] Suppose 
that 5* = {vs,vq} and L = 4. There is an L-length random 
walk (v\ , V2 , V3 , V2 , vq ) starting from v\ . Since this random walk 



reaches node ve and ve £ S, we call that vi hits S or 5* dominates 
vi. Clearly, if u 6 S, then u hits S. Next, we define another im- 
portant concept called generalized hitting time which measures the 
hitting time from a single source node to a set of targeted nodes 5*. 
Specifically, let T^ s be a random variable defined as 

T^ s = min{min{t : Z* eS,(> 0}, L}. (3) 

By this definition, T u $ denotes the number of hops of that the L- 
length random walk starting at u hits any node in 5* for the first 
time. Reconsider the example in Fig.Q] Suppose that u = V\, S = 
{f5 , ^6 } an d a 4-length random walk (vi , V2 , f 3 , f 2 , va ) starting at 
vi . Then, T^ s = 4 because the L-length random walk starting at 
u = vi hits a node va £ S at time 4 for the first time. Note that if 
S = 0, we have T^ s = L. This is because if S is an empty set, then 
u cannot hit S, and thereby min{£ : Z u £ S, t > 0} is infinity. 
In addition, if L = 0, then T£ s = as min{t : Z% 6 S,t > 
0} > 0. Based on T^ s , the generalized hitting time from u to S 
denoted by h^s is defined by the expectation of T^ s , i.e., h^s = 
E[T^ S ], By this definition, the smaller h^ s suggests that the node u 
is more easier to hit a node in S through an L-length random walk. 
Similarly, the generalized hitting time can be computed according 
to the following theorem. 

Theorem 2.2: For any node u and set S, h^ s can be computed by 
0,u e S 



S. 



(4) 



Proof: The proof is very similar to the proof of Theorem 12. 11 thus 
we omit it for brevity. □ 
Note that for L = 0, we have h^ s = 0, as T° s = 0. Based on 
the generalized hitting time, the first type of random-walk domina- 
tion problem is to minimize the sum of the generalized hitting time 
from the nodes in V\S to the targeted set of nodes S subject to that 
IS*! < k. More formally, this problem is formulated as 

niin £ h^ s 

uev\S (5) 
s.t. \S\ < k. 

It is easy to verify that the above optimization problem is equiva- 
lent to the following one. For convenience, in the rest of this paper, 
we refer to the following problem as the first type of random-walk 
domination problem and denoted it by Problem (1). 

Problem (1): 

max nL — £ 

uev\S (6) 
s.t. \S\ < k. 

Second, we formulate the second type of random-walk domina- 
tion problem. Let X^ s be a random variable such that X^s — 1 if 
node u hits 5* by an L-length random walk, X^ s = otherwise. 
Given a graph G and a constant k, the second type of random-walk 
domination problem is to maximize the expected number of nodes 
that can be dominated by the set S subject to a cardinality con- 
straint, i.e., |5| < k. Formally, the problem is defined as 

Problem (2): 

maxE[ £ X^] 



\S\ < k. 



(7) 



Let p^ s be the probability of an event that an L-length random 
walk starting from node u successfully hits a node in S. Then, 
we have E[X„ S ] = p^s- Moreover, by definition, we have the 
following theorem. 

Theorem 2.3: For L > 0, we have 
l,ueS 



L-l 
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Proof: The proof can be easily obtained by definition, we there- 
fore omit for brevity. □ 
For L = 0, we define p ( l s = 1 if u € S, p ( is = otherwise. 
The rationale is that a 0-lengfh random walk means that a node 
does not walk to any other nodes. Therefore, if u £ S, we have 
p^ s — 1, p^ s = otherwise. It is important to emphasize that 
Problem (2) is different from Problem (1). Because Problem (2) 
is to maximize the expected number of nodes that hit the targeted 
set by the L-lengfh random walk, while Problem (1) is to minimize 
the total expected time (or the expected number of hops) of which 
every node hits the targeted set. 

Distinguishing Problem (2) from the influence maximization 
problems: The influence maximization problem in social networks 
is to select k nodes to maximize the expected influence spread 
from those k nodes based on a influence spread model 1111 . A 
commonly-used influence spread model is the independent cascade 
model mi . where a user influences his friends with a pre-specified 
probability and the influence spread along an edge is independent 
of the influence spread over the other edges. More specifically, un- 
der the independent cascade model, the social network is modeled 
by a probabilistic graph, where each edge is associated with a prob- 
ability and all of those probabilities are independent of one another. 
The influence maximization problem is to select k nodes to max- 
imize the expected number of nodes that are reachable from the 
selected nodes. Recall that Problem (2) is to select k nodes to max- 
imize the expected number of nodes that can reach a node in the tar- 
geted node set following an L-length random walk. Although these 
two problems are seemingly similar, the Problem (2) is totally dif- 
ferent from the influence maximization problem. The reasons are 
as follows. First, Problem (2) is based on an L-length random walk 
model which is a Markov-Chain model where the visiting proba- 
bility of a node depends on the visiting probability of its immediate 
neighbors. The influence maximization problem, however, is based 
on the independent cascade model where the probabilities associ- 
ated on the edges are independent of one another. Second, in the 
influence maximization problem, a targeted node could influence 
multiple immediate neighbors at a discrete time. However, in an 
L-length random walk model, each node only follows one imme- 
diate neighbor. Let us consider a concrete example to illustrate this 
point. For example, in Fig. [T] we assume that there is a 4-length 
random walk (vi, 1)2, V3, 1)2, ve) starting from vi. Suppose that in 
the independent cascade model, the node vi has successfully in- 
fluenced node V2 and V3. Clearly, in this case, vi has only one 
descendant node in the L-length random walk model, while in the 
independent cascade model Wi has two. Finally, the influence max- 
imization problem relies on the predefined influence probabilities 
where all the influence probabilities are the input parameters. In 
Problem (2), we do not require the knowledge of influence proba- 
bilities. The only input parameters of our problems are the graph 
topology and the parameter k. 

3. THE ALGORITHMS 

The goal of this section is to present algorithmic treatments for 
Problem (1) and Problem (2). Specifically, we first prove that both 
Problem (1) and Problem (2) are the instances of the submodular 
set function maximization with cardinality constraint problem 1271 . 
In general, these problems are NP-hard 0271 . Therefore, we strive 
to devise approximate algorithms for these problems. In the follow- 
ing, we will present two efficient greedy algorithms for Problem ( 1 ) 
and Problem (2) with near-optimal performance guarantee. 

3.1 Submodularity and greedy algorithm 



Algorithm 1 The greedy algorithm 

Input: A graph G = (V, E), and a parameter k 
Output: A set of nodes S 

1: 0; 

2: for i = 1 to k do 

3: v <- arg max {F(S U {«}) - F(S)}; 

u£V\S 

4: S<-SU{ii}; 
5: return S; 



Before we proceed, let us give a definition of the non-increasing 
submodular set function 1271 . 

Definition 3.1: Let V be a finite set, a real valued function f(S) 
defined on the subsets of V, i.e, S C V, is called a nondecreasing 
submodular set function, if the following conditions hold. 

• Nondecreasing: For any subsets S and T of V such that 
SCTCV.we have f(S) < f(T). 

• Submodularity: Let a^S) = f(S U {j}) - f(S) be the 
marginal gain. Then, for any subsets S and T of V such that 
S C T C V and j £ V\T, we have aj(S) > ff,-(T). 

Then, based on the definition of submodular function, we show 
that the objective functions of Problem (1) and Problem (2) are 
submodular. Specifically, let Fi(S) = nL — J2 u ev\s ^uS' an£ ^ 
F 2 (S) = EE ugv X us\- Then > we have the following two theo- 
rems. 

Theorem 3.1: Fi{S) is a non-increasing submodular set function 
withF^) = 0. 

Proof: See Appendix. □ 

Theorem 3.2: F2(S) is a non-increasing submodular set function 
withF 2 (0) = 0. 

Proof: See Appendix. □ 
Based on the submodularity of Li and F2, we present a greedy 
algorithm for both Problem (1) and Problem (2) depicted in Algo- 
rithm Q] The greedy algorithm works in k rounds (line 2-4). In 
each round, the algorithm selects a node with maximal marginal 
gain (line 3), and adds it into the answer set S (line 4) which is ini- 
tialized by an empty set (line 1). Note that to solve the Problem (1) 
and Problem (2), we need to replace the function F in AlgorifhmQ] 
with F\ and F2 respectively. By a celebrated result in 1271 , Algo- 
rithm[T]achieves a (1 — 1/e) approximation factor for problem (1) 
and problem (2), where e ~ 2.718 denotes the Euler's number. 

Complexity analysis: The time complexity of Algorifhm[T]is dom- 
inated by the time complexity for computing the marginal gain 
(line 3). Below, we focus on an analysis of the greedy algorithm for 
Problem (1), and similar analysis can be used for Problem (2). For 
Li, let a u (S) = Fi (5) - Fi(Su{u}) be the marginal gain. Then, 
o~ u (S) can be calculated based on Eq. (0- Note that Eq. I0 imme- 
diately implies a dynamic programming algorithm for computing 
h^s- Given a set S, the time complexity for computing h^s ls 
O(mL). Therefore, given a set S, the time complexity for calculat- 
ing Fi(S) = Y^,ueV\s ( L ~ h us) is 0(nmL). Since the greedy 
algorithm needs to find the node with maximal marginal gain, it 
needs to evaluate Fi(S U {it}) for every node u in V\S. As a 
result, the time complexity of the greedy algorithm is 0(kn 2 mL). 
We can use the so-called lazy evaluation strategy 1191 to speed up 
the greedy algorithm, which could result in several orders of mag- 
nitude speedup as observed in 1191 . For the space complexity, the 
dynamic programming algorithm needs to maintain a n x L array 
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for a given S. To compute the marginal gain, the greedy algorithm 
needs to evaluate Fi(S U {«}) for every node u in V\S, thus the 
space complexity of the greedy algorithm is 0(n 2 L). Similarly, for 
problem (2), the time and space complexity of the greedy algorithm 
are 0(kn 2 mL) and 0(n 2 L) respectively. 

Approximate marginal gain computation: Based on the com- 
plexity analysis, the greedy algorithm is clearly impractical. The 
most time and space consuming step in the greedy algorithm is to 
compute the objective functions and the corresponding marginal 
gains. Here we present a sampling-based algorithm to approxi- 
mately compute the objective functions and the marginal gains ef- 
ficiently. 

Given a set S, to estimate the objective function Fi(S) (F 2 (S)), 
the key step is to estimate h^ s (E[X^ S ]). Below, we firstly de- 
scribe an unbiased estimator for estimating h^ s . To construct an 
unbiased estimator for h^ s , we independently run R L-lengfh ran- 
dom walks starting from node u. Assume that there are r such 
random walks that have hit any arbitrary node in S for the first time 
at {ti 1 , • • • , U r } hops. Then, we construct an estimator for h^ s by 



hi 



R 



(9) 



Algorithm 2 Sampling algorithm for estimating F\ (S) and F2 (S) 

Input: A graph G = (V, E), two parameters L and R 
and a set S 

Output: Unbiased estimators for F\ (S) and F 2 (S) 

1: «- 0; 

2: F 2 (S) «- 0; 

3: for each node u £ V\S do 

4: r <- 0; 

5: t <- 0; 

6: for i = 1 : R do 

7: Run an L-length random walk from it; 

8: if the random walk hits any arbitrary node v in S for the first time 

then 

9: r <— r + 1; 

10: Recordt; be the number of nodes of the random walk segment 

from node u to v; 

11 

12: Fi(S)i-Fi{S) + (t+(R-r)L)/R; 
13 
14 
15 
16 



t -S- t + U; 
Fi(S)-^F 1 (S) + (t+(R- 
F 2 (S) <- F 2 (S) +r/R; 

#i (5) <- |V\S| xL-A(S); 

F 2 (5) <-F 2 (S) + |S|; 

return Fi(S) andF 2 (S); 



The following lemma shows that h^s is an unbiased estimator 
of hi 3 . 

Lemma 3.1: h^s is an unbiased estimator ofh^s- 
Proof: Recall that h^ s = E[T£ S ]. By Eq. 0}, T% s denotes the 
first time that an i-length random walk starting from u hits any 
arbitrary node in S. If such a random walk cannot hit the nodes in 
S, then Tjfg = L. To estimate the expectation of T^ s , we indepen- 
dently run R L-length random walks starting from u, and take the 
average hitting time as the estimator. The proposed sampling pro- 
cess is equivalent to a simple random sampling with replacement, 
thus the estimator is unbiased. n 
Based on h^ s and LemmaO A (5) = E„ £ v\s ( L ~ ~ h us) 
is also an unbiased estimator of F\ (S). Similarly, we can construct 
an estimator for E[X^" S ] by 



E\X 



is] 



R 



(10) 



Also, the estimator E[X^ S ] is unbiased. 

Lemma 3.2: EfA^g] is an unbiased estimator ofM[X^ s ], 
Proof: The proof can be easily obtained by definition, we omit it 
for brevity. n 



Likewise, based on K[X^ S ] and Lemmal3~2l Fo (S) = T, U £V^i X usl 
is an unbiased estimator of F 2 (S). We remark that in 1301 , Sarkar et 
al. presented a similar unbiased estimator for estimating the hitting 
time of the L-length random walk between two nodes. Here our es- 
timator (h u s) is to estimate the hitting time of the L-length random 
walk between one source node and one targeted set. In this sense, 
our estimator is more general than the estimator presented in 1301 . 
Below, we make use of the Hoeffding inequality (9) to bound the 
sample size R. Specifically, we have the following two lemmas. 

Lemma 3.3: Given a set S, for two small constants e and 5,ifR> 
^\og^,thenPr[\F 1 (S)-F 1 (S)\>e(n-\S\)L]<5. ~ 
Proof: First, we have 

Pr[|F 1 (5)-F 1 (S)|> £ (n-|S|)L] 

< Pr[E„ 6 v/s \Ks ~ K S \ > e(n - \S\)L], 

because the event of \Fi(S) — Fi(S)\ > e(n — \S\)L implies 
the event of Y^ugv/s l^«s - h uS \ > s(n - \S\)L. Then, by the 



union bound, we have 

Pr[ E \'huS-h uS \ >e(n-\S\)L] 
uev/s 

< E„ e v/s Pr 0^«S ~ h «sl > sL.] 

Since < h u s < L (Lemma [2. U . we can apply the Hoedding 
inequality (5) to bound sample size R. Specifically, we have 

Pr[\h uS - h uS \ > sL] < exp(-2 £ 2 i?). 
Based on this, the following inequality immediately holds 

Pr[|Fi(S) - F!(S)\ > e(n - \S\)L] <(n - \S\) exp(-2e 2 i?). 

Let(n-|5 , |)exp(-2e 2 i?) < S, then we can get R > log ^^i, 
which completes the proof. □ 

Lemma 3.4: Given a set S, for two small constants t and S, if 
R > abr log f, then Pr[|F 2 (S) - F 2 (S)\ > en] < 6. 
Proof: The proof is similar to the proof of Lemma 13.31 thus we 
omit for brevity. □ 

Based on the above analysis, in Algorithm[2] we present a sampling 
based algorithm to estimate Fi(S) and F 2 (S) given a set S. Note 
that the marginal gains o u (S) = F 1 (SU{u})~F 1 (S) and p u {S) = 
F 2 (S U {u}) — F 2 (S) can be easily estimated by invoking Algo- 
rithm [2] twice. There are three input parameters L, R, and S in 
Algorifhm[2] where J? is a small value and it can be determined ac- 
cording to Lemma [331 and Lemma [3741 To compute the estimator 
of Fi(S) and F 2 (S), for each node in V\S, Algorithm [2] indepen- 
dently runs R L-length random walks (line 3-15), and records two 
quantities r and t (line 9-11). Based on r and t, Algorithm [2] can 
easily compute Ft (5) and F 2 (S) (line 12-15). It is worth mention- 
ing that for the node u G S, we have E[A'„ S ] = 1. Therefore, in 
line 15, the algorithm adds |5| into F 2 (S). Finally, the algorithm 
outputs the two estimators. 

The time complexity of Algorithm [2] is 0(nRL). This is be- 
cause, running an L-length random walk takes O(L) time com- 
plexity, and for each node, the algorithm needs to run R L-length 
random walks. The space complexity of Algorithm[2]is 0(m + n), 
which is linear w.r.t. the graph size. Based on Algorithm [2] the 
time complexity of the greedy algorithm is reduced to 0(kn 2 RL), 
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and the space complexity of the greedy algorithm is linear, which is 
significantly better than the greedy algorithm with exact marginal 
gain computation using a dynamic programming (DP) algorithm. 
Since Algorithm [2] can be applied to compute a good approxima- 
tion of the marginal gain, the performance guarantee of the greedy 
algorithm with sampling-based marginal gain computation can be 
preserved. In effect, by a similar analysis presented in Hill , such 
a greedy algorithm can achieve al — 1/e — e approximation fac- 
tor through setting an appropriate parameter 72. In addition, it is 
worth noting that the sampling-based greedy algorithm can also be 
accelerated using the lazy evaluation strategy 1191 . 

3.2 Approximate greedy algorithm 

Although the sampling-based greedy algorithm are much more 
efficient than the DP-based greedy algorithm, the time complex- 
ity of the sampling-based greedy algorithm is 0(kn 2 RL), which 
implies that such an algorithm can only be scalable to medium 
size graphs. Here we propose an approximate greedy algorithm 
for both problem (1) and problem (2) with linear time complexity 
(w.r.t. graph size) and near-optimal performance guarantee. Recall 
that in the sampling-based greedy algorithm, we need to invoke the 
sampling algorithm (Algorithm |2) to estimate the marginal gain 
<j u (S) for each node u. In each round, the greedy algorithm needs 
to find the node with maximal marginal gain. Note that there are 
n — \S\ nodes in total. Thus, the sampling-based greedy algorithm 
requires to invoke Algorithm [2] O(kn) times in k rounds, which 
indicates that the algorithm needs to run 0{kn 2 R) L-length ran- 
dom walks. Can we reduce the sample complexity of the sampling- 
based greedy algorithm? In this subsection, we give an algorithm 
that only requires to run 0(nR) L-length random walks, and it also 
preserves the 1 — 1/e — e approximation factor. For convenience, 
we call this algorithm an approximate greedy algorithm. Below, 
we mainly focus on describing the algorithm for problem (1), and 
similar descriptions can be used for problem (2) (we have added 
some remarks for problem (2) in Algorithm |3]|4]|5] and|6}. 

The key idea is described as follows. First, for each node, the 
algorithm independently runs R L-length random walks. Then, the 
algorithm materializes such samples (An L-length random walk is 
a sample), and applies them to estimate the marginal gain a u (S) for 
any given node u and a given set S. Here the challenge is how to 
estimate a u (S) efficiently using such samples, because S changes 
in each round of the greedy algorithm. To overcome this challenge, 
we present an inverted list structure to index the samples. Specif- 
ically, we build R inverted lists, and each inverted list includes n 
sublists. For each node u, a sublist indexes all the other nodes that 
hit u through an L-length random walk. Here the entry of the sub- 
list is an object that includes two parts: a node ID (id) and a weight 
(weight), denoting id hits u at weight-lh hop. Algorithm [3] de- 
picts the inverted index construction algorithm. In Algorithm[3] the 
R inverted lists, denoted by J[l : R][l : n], are organized as a two- 
dimensional list array, in which I[i] [v] indexes all the nodes that hit 
v by the i-th L-length random walk. First, the algorithm initializes 
7[1 : R][l : n] by an empty array (line 1). Then, for each node w 
in V, the algorithm runs R L-length random walks (line 2-14). Let 
us consider the i-th L-length random walk starting at node w. If 
w hits a node v, the algorithm creates an object < w, weight >, 
where weight denotes that w hits v at weight-hop (line 11-12). 
Then, the algorithm adds it into 7[i][t>] (line 13). Note that for the 
repeated nodes in an L-length random walk, we only need to index 
one node and record the weight at the first visiting time according 
to the definition of hitting time. To remove such repeated nodes in 
an L-length random walk, the algorithm maintains a visited[l : n] 
array (line 4, 6 and 9-10). 



Algorithm 3 Invert_Index(G, L, R) 



Input: A graph G = (V, E), two parameters L and R 
Output: An inverted index I[l : R][l : n] 



1: 

2: 
3: 
4: 
5: 



9: 
10: 
11: 
12: 

13 
14 
15 



Initialize an inverted list I[l : R][l : n] <- NULL ; 
for each node w 6 V do 
for i = 1 : R do 

Initialize visited[l : n] 4— 0; 
uf-m; 

visited[u] <— 1; 
for j = 1 : L do 

Randomly select a neighbor of u, denoted by v; 
if visited[v] == then 
visited[v] <— 1; 
Object. id <— w; 

Object. weight <— j; l*w hits v at j'-th step*/ 
I* Object. weight <— 1; for problem (2)*/; 
I[i] [v].push_back(Object); 
u <— v; 
return /[l : R][l : n]\ 



Given the inverted lists J[l : 72] [1 : n], how to estimate the 
marginal gain for any node u and a given set SI Here we tackle 
this issue by maintaining a two-dimensional array D[l : R][l : n]. 
Given a set 5", D[i] [u] denotes an estimator of the hitting time h^s 
based on the i-th L-length random walk. Let S u — S U {u}, and 
o~u(S) = Fi(Su) — Fi(S) be the marginal gain. Then, we can 
derive that a u (S) = Eu, G y\s„ ( fe »s - h^sj + h% s - L. Re- 
call that in each round of the greedy algorithm, we need to find 
the node with maximal marginal gain. Therefore, for each node it, 
we can estimate a u 



+ htg, because 



by J2 w ev\s M (hts ~ ^ws u ) ~r "-us- 
"— L" dose not affect the results. Algorithm [4] describes an algo- 
rithm for estimating a u . Let us consider the i-th L-length random 
walk. First, a u is initialized by 0. Then, the algorithm adds D [i] [u] , 
which is an estimator of h^ s , to a u (line 3). And then, the algo- 



rithm estimates V* 



ev\s, 



(h wS — h wSu ) and adds it to a u , which 



is implemented in line 4-7. By definition, if a node v in V\S U dose 
not hit u, then we have h^ s = h^ Su . Thus, the algorithm only 
needs to consider the nodes that hit u (line 4), which is indexed in 
7[i][u]. If hy U < h^s, then the algorithm adds h^ s — hy U to a u . 
Otherwise, we have h^ s — h^ Su . Note that by definition, h^ u can 
be estimated by the weight associated with v which is indexed in 
7[i][tt], and h^ s can be estimated by £)[i][u], and thus h^ s — hy U 
can be estimated by 7)[i][u] minus the weight associated with v 
(line 7). Therefore, line 3-7 of Algorithm[4]is to estimate a u based 
on the i-th L-length random walk. Finally, Algorithm [4] takes an 
average over all the 7? estimators (line 10). 

Algorithm[4]can be used to estimate the marginal gain for every 
node given a set S. In the greedy algorithm, after one round, the 
size of S increases by 1. Hence, we need to dynamically maintain 
the array 7J[1 : 72] [1 : n] when S is changed. Algorithm|5]depicts 
an algorithm to update D[l : 72] [1 : n] given S is inserted an 
element u. As usual, let us consider the i-th L-length random walk. 
By definition, for a node v, if h^ u < h^ s , then we need to update 
D[i][t>], Otherwise, we have h^ s = h^ Su , thus no need to update 
D[i] [v]. In addition, for a node v that does not hit u, we do not need 
to update 7)[i][«] as h^ s = h^ Su by definition. In Algorithm[5] the 
algorithm firstly sets T)[i][it] to (line 2), because h^s u = (u is 
in S„). Then, the algorithm updates D[i] [v] for the node v that has 
hit u by the i-th L-length random walk (line 3-6). 

Equipped with Algorithm [3] Algorithm |4] and Algorithm [5] we 
present the approximate greedy algorithm in Algorithm [6] First, 
Algorithm |6]builds 72 inverted lists (line 1). Second, the algorithm 
initializes the answer set S to an empty set (line 2), and sets the 
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Algorithm 4 Approx_Gain(J[l : R][l : n], D[l : R][l : n], it, R) 

Input: The inverted index I[l : R] [1 : n], the array D[l : R] [1 : n], 

a node u and parameter R 
Output: Approximate marginal gain cr u 



Initialize a u <s— 0; 
for i = 1 : R do 

a u <— o"u + -D[i][u]; 

l*a u <- a u + 1 - D [«][«]; for problem (2)*/ 
while Object <— I[i][u].pop() do 
v <— Object.id; 

it Object. weight < D[i][v] then 

<r u <— o" u + D[i][i)] — Object. weight; 
/*for problem (2), use line 8-9 to replace line 6-7*/ 
it Object. weight > D[i][v] then 

c« o"u + Object.weight — D[i][v]; 
cr u <- cru/R; 
return cr u ; 



Algorithm 5 Update(I[l : R][l : n], D[l : R][l : n], u, R) 



Input: The inverted index I[l : R][l : 

a node u and parameter R 
Output: The updated array D[l : R][l 



, the array D[l : R][l 



for i 

D 



1 



Rdo 

<- 0; l*D[i][u] <- 1; for problem (2)*/ 
while Object <— I [i][u] .pop() do 
?j <— Object.id; 

it Object.weight < D[i][v] then 

f[i][i>] <~ Object.weight; 
/*for problem (2), use line 7-8 to replace line 5-6)*/ 
it Object.weight > D[j][u] then 

■D[i][u] 4— Object.weight; 



value of each entry in D[l : R][l : n] to L (line 3), because h^s = 
L given S = 0. Third, the algorithm works in fc rounds (line 4-7). 
In each round, the algorithm invokes Algorithm |4] to estimate the 
marginal gain a u (S), and selects the node v with maximal a u (S). 
Then, the algorithm adds i) into the answer set S. After that, the 
algorithm invokes Algorithm [5] to update D[l : R][l : n]. The 
following example illustrates how the Algorithm|6]works. 

Example 3.1: Let us re-consider the example graph shown in Fig.Q] 
For simplicity, we set R = 1, L = 2, and k = 2. Suppose that 
the 2-length random walks for each node are described as follows: 

(V1,V 2 ,V 3 ), (V2,V3,V 5 ), (V 3 ,V 2 ,V 5 ), (v 4 ,V 7 ,V 5 ), (V5,V 2 ,V 6 ), («6, 

(v7, V5,V7), and (v$, V7, V4). Then, the inverted index constructed 
by Algorithm [3] (7[1][1 : 8]) is illustrated Tabled Note that in 

Table 1: Inverted index 



1'7 



< «1,1 >, < V 3 , 1 >, 

< VI, 2 >, < V2,l > 

<v s ,2> 

< i)2,2 >, < v 3 ,2 >, 

< v B ,2 > 

< U4, 1 >, < vg, 1 >, 



<V B ,1> 

< v 4 , 2 >, < v 6 , 2 >, < v 7 , 1 > 

< V S , 1 > 



(v7, V5,V7), v 7 is a repeated node, thus the second V7 will not be 
inserted into the inverted list by Algorithm|3] After building the in- 
verted index, Algorithm |6]initializes S to an empty set, and set all 
the elements of -D[l][l : 8] to 2. Then, in the first round, the algo- 
rithm invokes Algorithm|4]to estimate the marginal gain er u (0) for 
each node. After this step, we can get that cr Vl (0) = 2, a V2 (0) = 5, 
a V3 (0) = 3, a„ 4 (0) = 2, <r„ B (0) = 3, a„ 6 (0) = 2, a v ~{®) = 5, 
and a vs (0) = 2. For instance, for node V2, there are three elements 
in the inverted list / [1] [2] . Since the weights of vi, V3, and v$ (all of 



Algorithm 6 The approximate greedy algorithm 

Input: A graph G = (V, E), and a parameter k 
Output: A set of nodes S 

1: 7[1 : R][l : n] ^Invert_Index(G, L, R); 
2: S 4- 0; 

3: Initialize D[l : R][l : n] <- L; 

l*D[l : R][X : n] *r- 0; for problem (2)*/ 
4: for i = 1 to k do 

5: v <r- arg max Approx_Gain(/[l : fflfl : n],D[l : R][l 

uev\s 

n], u, R); 
6: S 4- SU {v}; 

1: Update(/[1 : R][l : n], D[l : R][l : n], v, R); 
8: return S; 



them equal to 1) are smaller than D[l] [1], D[l] [3], and D[l] [5] (all 
of them equal to 2) respectively, thus a V2 (0) = D[l] [2] + 3 = 5 as 
desired. Similar analysis can be used for other nodes. Clearly, V2 
and V7 achieve the maximal marginal gain. The algorithm breaks 
ties randomly. Assume that in this round, the algorithm selects 
V2 and adds into S. Then, the algorithm invokes Algorithm [5] 
(Update(J[l][l : 8],D[1][1 : 8], v 2 , 1)) to update D[l][l : 8]. Af- 
ter this step, we can obtain that only D[l] [2], D[l] [1], D[l] [3], and 
D[l][5] need to be updated, and they are re-set to 0, 1, 1, and 1 
respectively. Similar arguments can be used for analyzing the sec- 
ond round. Here we only report the result, and omit the details for 
brevity. In the second round, the algorithm adds V7 into the an- 
swer set. Therefore, the algorithm outputs {V2, V7} as the targeted 
nodes. □ 

We analyze the time and space complexity of Algorithm|6]as fol- 
lows. First, to build the inverted index (line 1), Algorithm [3] takes 
O(RLn) time complexity. Second, to estimate the marginal gain 
for every node, the algorithm needs to invoke Algorithm |4]0(w) 
times. We can derive that the time complexity of this step (line 5) 
is 0(nRL), because the algorithm only needs to access the entire 
inverted index once and the size of the inverted index is bounded 
by 0(nRL). Third, to update D[l : R][l : n], Algorithm [5] takes 
at most O(Rn) time. Put it all together, the time complexity of Al- 
gorithm[6]is O(kRLn), which is linear w.r.t. the graph size (R, k, 
and L are small constants). For the space complexity, the algorithm 
needs to maintain two arrays: the inverted index I[l : R][l : n] 
and the array D[l : R][l : n\. Clearly, I[l : R][l : n] and 
V5p[l : R][l : n] are bounded by O(RLn) and 0{Rn) respectively. 
Therefore, the space complexity of Algorithm|6]is 0(nRL + m). 

Note that in Algorithm [6] each marginal gain is estimated by the 
same R L-length random walks. Since the L-length random walks 
are independent of one another, the estimator is able to achieve high 
accuracy. As a result, the approximation factor of Algorithm [6] is 
1 — 1/e — e by setting an appropriate R. In the experiments, we 
find that the effectiveness of Algorithm [6] are comparable with the 
DP-based greedy algorithm even when R is a small value (e.g., 
R = 100). 

4. EXPERIMENTS 

In this section, we conduct extensive experiments over both syn- 
thetic and real-world graphs. We aim at evaluating the effective- 
ness, efficiency and scalability of our algorithms. In the following, 
we first describe the experimental setup and then report the results. 

4.1 Experimental setup 

Different algorithms: Since the proposed random-walk domina- 
tion problems are novel, we are not aware of any algorithm that 
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Table 2: Summary of the datasets 



Name 


# of nodes 


# of edges 


CAGrQc 


5,242 


28,968 


CAHepPh 


12,008 


236,978 


Brightkite 


58,228 


428,156 


Epinions 


75,872 


396,026 



addresses to these problems in the literature. Intuitively, the high- 
degree nodes are more easily reached by the other nodes. There- 
fore, to maximize the expected number of reached nodes, a rea- 
sonable baseline algorithm is to select the top-fc high-degree nodes 
as the targeted nodes. For convenience, we refer to this baseline 
algorithm as the Degree algorithm. The second baseline is the 
traditional dominating- set-based algorithm (8). A dominating set 
is a subset of nodes D C V such that every node in V is ei- 
ther in D or a neighbor of some nodes in D (8). By this defini- 
tion, every node can only dominate its neighbors. In our problems, 
since we have a cardinality constraint, i.e, \S\ < k, we cannot se- 
lect the entire dominating set. Instead, we turns to select k nodes 
such that they can dominate as many nodes as possible. Note that 
here the concept of domination is based on the definition of tra- 
ditional dominating set. Specifically, let S be the set of targeted 
nodes. Initially, S is an empty set. The algorithm works in k 
rounds. In each round, the algorithm selects a node v such that 
v = argmax ug v/s — -W(£)l> where N(S) denotes the 

set of immediate neighbors of nodes in S. Then, the algorithm adds 
v into the set S, and goes to the next round. We call this algorithm 
the Dominate algorithm. 

We compare two proposed algorithms with the above two base- 
line algorithms. The first algorithm is the DP-based greedy algo- 
rithm, in which the marginal gain is calculated by the DP algo- 
rithm. The second algorithm is the approximate greedy algorithm 
i.e., Algorithmic Both of them are used to solve both problem (1) 
(Eq. l|6}) and problem (2) (Eq. Q). Here we do not report the re- 
sults of the sampling-based greedy algorithm because the approxi- 
mate greedy algorithm is more efficient than such an algorithm. For 
convenience, we refer to the first algorithm for solving problem (1) 
and problem (2) as DPF1 and DPF2 respectively. Similarly, we 
call the second algorithm for solving problem (1) and problem (2) 
as ApproxFl and ApproxF2 respectively. 

Evaluation metrics: Two metrics are used to evaluate the effec- 
tiveness of different algorithms. The first metric is the average 
hitting time which is defined as Mi(S) = J2 u ev\s h us/\V\S\, 
where S denotes the set of selected nodes by a algorithm. This met- 
ric inversely measures the effectiveness of the algorithm. In other 
words, the smaller the Mi (5) is, the more effective the algorithm 
is. The second metric is the expected number of nodes that hit a 
node in S via an L-length random walk. The formula of the second 
metric is given by M 2 (S) = E„ 6 y E PCs]- The lar ger M 2 (S) 
implies the higher effectiveness of the algorithm. For convenience, 
we refer to the first metric and the second metric as AHT and EHN 
respectively. Note that to compute these metrics, we uses the sam- 
pling algorithm described in Algorithm [2] and set the sample size 
R — 500. To evaluate the efficiency of different algorithms, we 
record the running time, which is measured by the wall-clock time. 

Datasets: We use four real-world datasets in our experiments: CA- 
GrQc, CAHepPh, Brightkite, and Epinions. The CAGrQc and CA- 
HepPh datasets are co-authorship networks which represent the co- 
authorship over two different areas in physics respectively. The 
Brightkite is a location-based social network dataset, where the 
users in Brightkite can check-in spots and share their location in- 
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Figure 2: Comparison of effectiveness of DPF1 and ApproxFl 
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Figure 3: Comparison of effectiveness of DPF2 and ApproxFl 

formation with their friends. The Epinions is a trust social network 
dataset, where the edge represents the trust relationship between 
two users. All the four datasets are downloaded from Stanford net- 
work data collections H 1 31 - The detailed statistic information of the 
datasets are shown in Table [2] 

Experimental environment: We conduct all the experiments on 
a Windows XP PC with 2xQuad-Core Intel Xeon 2.66 GHz CPU, 
and 8GB memory. All the algorithms are implemented in C++. 

4.2 Experimental Results 

Performance of the approximate greedy algorithms: Here we 
compare the effectiveness and efficiency of the approximate greedy 
algorithms (ApproxFl and ApproxFl) with those of the DP-based 
greedy algorithm (DPF1 and DPF2). Due to the expensive time 
and space complexity of the DPF1 and DPF2 algorithms, these 
two algorithms can only work well on very small datasets. To this 
end, we generate a small synthetic graph with 1000 nodes and 9956 
edges based on a commonly-used power-law random graph model 
171 . We set the parameter k to 30 which denotes the number of 
selected nodes of different algorithms, and set the parameter L in 
the L-length random walk model to 5 and 10 respectively. Similar 
results can be observed for other values of k and L. The results are 
shown in Fig. [2] and Fig. [3] Specifically, Fig. [2] depicts the com- 
parison of effectiveness of DPF1 and ApproxFl algorithms. The 
black dash line in Fig. [2] describes the effectiveness of the DPF1 
algorithm, while the red solid curve depicts the effectiveness of the 
ApproxFl algorithm as a function of the parameter R, denoting the 
number of samples used to estimate the marginal gain. As can be 
seen in Fig. [2] the ApproxFl algorithm is very accurate when the 
number of samples is greater than or equal to 50. For example, in 
Fig. [5J a )> the greatest difference of AHT between DPF1 and Ap- 
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(a) L=5 (b) L=10 

DPF1 ApproxFI DPF2 ApproxF2 DPF1 ApproxFI DPF2 ApproxF2 

Figure 4: Comparison of running time: DP-based greedy algo- 
rithms vs approximate greedy algorithms. 



(b) CAHepPh 





Figure 5: Running time as a function of R 

proxFl algorithms is around 0.01, which is achieved at R — 50. 
Moreover, when R — 100, the result of the ApproxFI algorithm 
matches the result of the DPF1 algorithm. In Fig. |2jc), we can 
see that the expected number nodes that can hit the selected nodes 
calculated by the ApproxFI algorithm is very close to the expected 
number of nodes computed by the DPF1 algorithm. The maxi- 
mal difference of EHN between DPF1 and ApproxFI algorithms is 
around 1.5, which is achieved at R = 200. 

Fig. [3] illustrates the comparison of effectiveness of DPF2 and 
ApproxF2 algorithms. Similarly, from Fig. [3] we can observe that 
the effectiveness of the ApproxFI algorithm is very close to that 
of the DPF2 algorithm. In Fig. [3]aX for instance, the maximal 
difference of AH T between the DPF2 and ApproxF2 algorithms is 
smaller than 0.01 (obtained at R = 100). Hence, for both AHT 
and EHN metrics, the approximate greedy algorithms work very 
well with a small R value. These results are consistent with the 
theoretical analysis in Section [3!2l 

Now we compare the running time of the approximate greedy 
algorithms (ApproxFI and ApproxF2) with that of the DP-based 
greedy algorithms (DPF1 and DPF2). The results are reported in 
Fig. E] From Fig. [4] we can clearly see that the running time of 
the DPF1 and DPF2 algorithms are significantly longer than the 
running time of the ApproxFI and ApproxF2 algorithms, where 
the running time of the ApproxFI and ApproxF2 algorithms are 
recorded at R = 250. For example, in Fig. UJa), the running time 
of the DPF1 algorithm is larger than 400 seconds, while the run- 
ning time of the ApproxFI algorithm is around 2 seconds. That is 
to say, the efficiency of the ApproxFI algorithm is better than that 
of the DPF1 algorithm by 200 times. It is worth mentioning that 
the running time of the DPF1 is twice as much as the running time 
of the DPF2. This is because the DPF1 algorithm needs an extra 
"addition operation" for computing the hitting time (Eq. ©) com- 
paring with the DPF2 algorithm. In addition, the running time of 
different algorithms when L = 10 is twice as much as the running 
time of different algorithms when L = 5. 

We also study the running time of the ApproxFI and ApproxF2 
algorithms as a function of the parameter R. The results are shown 
in Fig. [5] As observed, the running time of the ApproxFI and 
ApproxF2 algorithms increase linearly as R increases, which con- 
forms with that the time complexity of the approximate greedy al- 
gorithms is linear w.r.t. R. 

Effectiveness of different algorithms: Here we compare the ef- 
fectiveness of different algorithms over four real-world datasets. 
As indicated in the previous experiment, under both AHT and EHN 
metrics, there is no significant difference between the ApproxFI 
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Figure 6: Comparison of AHT of different algorithms 

(ApproxFI) algorithm and the DPF1 (DPF2) algorithm. Further- 
more, the former algorithms are more efficient than the latter al- 
gorithms up to two orders of magnitude. Hence, in the following 
experiments, for the greedy algorithms, we only report the results 
obtained by the ApproxFI and ApproxF2 algorithms. For these al- 
gorithms, we set the parameter R to 100 in all the following experi- 
ments without any specific statements, because R — 100 is enough 
to ensure good accuracy as shown in the previous experiment. For 
all the algorithms, we set the parameter L to 6, and similar results 
can be observed for other L values. Fig. [6] and Fig. [7j describe 
the results of different algorithms over four real-world datasets un- 
der AHT and EHN metrics respectively. From Fig. [6] we can see 
that both the ApproxFI and ApproxF2 algorithms are significantly 
better than the two baselines in all the datasets used. As desired, 
for all the algorithms, the AHT decreases as k increases. In addi- 
tion, we can see that the ApproxFI algorithm slightly outperforms 
the ApproxF2 algorithms, because the ApproxFI algorithm directly 
optimizes the AHT metric. Also, we can observe that the Dominate 
algorithm is slightly better than the Degree algorithm in CAHepph, 
Brightkite, and Epinions datasets. In CAGrQc datasets, however, 
the Degree algorithm performs poorly, and the Dominate algorithm 
significantly outperforms the Degree algorithm. Similarly, as can 
be seen in Fig. [7j the ApproxFI and ApproxF2 algorithms sub- 
stantially outperform the baselines over all the datasets under the 
EHN metric. Moreover, we can see that the ApproxF2 algorithm is 
slightly better than the ApproxFI algorithm, because the ApproxF2 
algorithm directly maximizes the EHN metric. Note that, under 
both AHT and EHN metrics, the gap between the curves of the 
approximate greedy algorithms and those of the two baselines in- 
creases with increasing k. The rationale is that the approximate 
greedy algorithms are near-optimal which achieve 1 — 1/e — e ap- 
proximation factor, and such approximation factor is independent 
of the parameter k. The two baselines, however, are without any 
performance guarantee, thus the effectiveness of these two algo- 
rithms would decrease as k increases. These results are consistent 
with our theoretical analysis in Section|3] 

Efficiency of different algorithms: Here we evaluate the effi- 
ciency of different algorithms. Fig. [8] shows the comparison of 
the running time of different algorithms over the Epinions dataset. 
Similar results can be obtained in other datasets. In particular, 
Fig.[8la) depicts the running time of different algorithms as a func- 
tion of the parameter k. Here the parameter L is set to 6. In par- 
ticular, from Fig.|8}a), we are able to observe that the running time 
of the ApproxFI and ApproxF2 algorithms are around 2.5 times 
longer than the running time of the Degree and Dominate algo- 
rithms. Fig.[8jb) illustrates the running time of different algorithms 
as a function of the parameter L, where we set the parameter k to 
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Figure 7: Comparison of EHN of different algorithms 




Figure 8: Comparison of running time of different algorithms 
in Fpinonios dataset 

100. As can be observed in Fig.[8jb), the running time of the Ap- 
proxFI and ApproxF2 algorithms are longer than that of the Degree 
and Dominate algorithms by 2.7 times at most. For example, when 
L = 10, the running time of the ApproxFI is 99 seconds, while the 
running time of the Degree algorithm is 37 seconds. These results 
indicate that the approximate greedy algorithms is only a small con- 
stant times longer than that of the Degree algorithm, which are con- 
sistent with the complexity analysis in Section [3~!2l 

Scalability testing: Here we evaluate the scalability of the Ap- 
proxFI and ApproxFI algorithms. To this end, we generate ten 
large synthetic graphs according to a widely-used power-law ran- 
dom graph model \\\. More specifically, we generate ten graphs 
(?!,••• , Gio such that Gi has i x 0.1 million nodes and i million 
edges for i = 1, •• ■ , 10. Fig.|9]shows the results of the ApproxFI 
and ApproxF2 algorithms w.r.t. the number of nodes (left panel) and 
w.r.t. the number of edges (right panel). Here we set the parameter 
L = 6 and k — 100. Similar results can be observed for other 
values of L and k. From Fig. [9] we find that both the ApproxFI and 
ApproxF2 algorithms scale linearly w.r.t. both the number of nodes 
and the number of edges, which is consistent with the linear time 
complexity (w.r.t. the graph size) of the algorithm. 

Effect of parameter L: Here we study the effect of parameter 
L. Fig. [10] reports the results in CAGrQc and CAHepPh datasets 
given k — 60. Similar results can be observed in other datasets and 
other values of k as well. From Fig.QJJJa-d), we can see that both 
the AHT and EHN by different algorithms increase as L increases. 
Recall that the hitting time is bounded by L, and the hitting time 
of a node that cannot hit the targeted nodes is set to L. Therefore, 
the average hitting time will increase if L increase. Clearly, with L 
increasing, the number of nodes that can hit the targeted nodes will 
increase, thereby the EHN of different algorithms will increase. In 
addition, we find that the gap between the curves of the ApproxFI 
and ApproxF2 algorithms and the curves of the baselines increases 
as L increases, which suggests that the ApproxFI and ApproxF2 
algorithms perform very well for large L values. 

5. CONCLUSIONS 




Figure 9: Scalability testing 




Figure 10: Effect of parameter L 

In this paper, we introduce and formulate two random-walk dom- 
ination problems in graphs motivated by a number of applications 
such as the item placement in social networks, the resource place- 
ment in P2P network, and the advertisements placement in adver- 
tisement networks. We show that these two problems are an in- 
stance of submodular set function maximization with cardinality 
constraint problem. Based on this, we propose a dynamic program- 
ming (DP) based greedy algorithm with 1 — 1/e approximation fac- 
tor to solve them. The DP-based greedy algorithm, however, is not 
very efficient because of the expensive marginal gain evaluation. 
To further accelerate the greedy algorithm, we present an approx- 
imate greedy algorithm with liner time complexity w.r.t. the graph 
size. We show that the approximate greedy algorithm is also with 
near-optimal performance guarantee. Extensive experiments are 
conducted to evaluate the proposed algorithms. The results demon- 
strate the effectiveness, efficiency, and scalability of the proposed 
algorithms. 

There are a number of future directions needed to further inves- 
tigation. First, since the objective functions of Problem (1) and 
Problem (2) are submodular, one may combines these two objec- 
tive functions (e.g., by a positive weights, it is still submodular) and 
study the problem of optimizing both the total hitting time and the 
expected number of nodes that hit the targeted set simultaneously. 
Second, Problem (2) is to count the expected number of nodes that 
are dominated by the targeted set. It would be interesting to ex- 
tend this problem to count the expected number of edges that are 
traversed by the L-length random walk starting from any node to 
the targeted set. Finally, Problem (2) is to maximize the expected 
number of nodes. A complementary problem is that given a param- 
eter a £ [0, 1], the goal is to find the minimum number of targeted 
nodes such that they can dominate at least an number of nodes in 
expectation. It would also be interesting to devise efficient algo- 
rithms for this problem. 

Appendix 

Proof of Theorem l2.ll By definition, we have the following facts. 

If < i < L, we have Pr[T^ = i] = J2 w£V P^ Pr l T ^ 



Fact 1: 

and if i 



L, we have Pr[T^ = i] = J2 wev Pu W Vr[T^ v > t - 1] 
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Fact 2: If < i < L-l, we have Prp^ -1 = i] = Pr[Tj; = i], 
and if i = L - 1, we have Prp^ -1 = i] = Prp™ = i] + 
Pr[l£ = L\. 

Equipped with the above two facts, we can prove the theorem as 
follows. Clearly, if u = v, then T^ v = 0, and thereby h% v = 0. If 
u ^ v, by definition, we have 

h>uv = ^[^uv] = Ei=l ^ P r [^uu = 

= Eg < Pr[ri; =«] + !/ Prp& = L] 
= Ei=i * S^gv P r [^unj = i — 1] 

= Ef=i iTi^evPw P4 T mv = i - 1] 

(ii) 

where the third equation holds due to Fact 1 . Then, we can further 
reduce Eq. dl It as follows. 

ht = Ef=i (» - 1) E^ev P«« Pr P^ = » - 1] 
+ EtiE»€vP»« Pr P'i'«=i-l] 

+ EtuGV P 1 ™ P r Puro — /|2-i 

+(L-l)E«, 6V p«»Pr[2^=L] 

= Ef=i (« - 1) E»gv P"« Pr [ T ^ = * - 1] 
+(i-i)E„ £ vP«» Pr [ :z ^ = I '] + 1 . 

where the equality holds is owing to E/ i=1 Pr[T^„ = i] = 1 and 
Eujg v P"™ = Based on Eq. H2\ and Fact 2, we have 

= El— 1 ^E U ;GV^ , " U ' P^t^tUV = ^] 

1) E„ e v Prfe = L] + 1 

1) E„ e y P««.(Pr[^„ = 1/ — 1] + Pr[T^ = L\) + 1 

= Ei=i * Euigi/ p«™ p r [^<«« = *] 

+ (i-l)E„ e vP^(Pr[T^- 1 = L-1])+1 {By Fact 2} 
= Et^E^vP- Pr^" 1 = i] + 1 

(13) 

This completes the proof. 

Proof of Theorem 13.11 First, it is easy to check that Fi(0) = 0. 
Second, we prove that F\ (S) is a non-increasing set function. Let 
S C T C V be two subsets of V. Then, for any node u 6 V\T, 
we claim that 

>&• < (14) 
We shall prove the above inequality by induction. By definition, 
we have h uT = h uS — and h uT — h u3 — 1. Therefore, the 
inequality defined in Eq. d 1 4b holds if L = and L = 1. Suppose 
that /i£ T < ft uS holds given L = a > 1. Below, we show that the 
inequality still holds if L = a + 1. By Eq. (O, we have 

^uS 1 = 1 + '%2 w gsP uw hwS 

= 1 + ^2 w ^tP uw ^ l wS + Eu)£T\S P uw ^w3 

•> 1 + E m ^TP" u, ^ i 2'S — 1 + E„^T PuwhwT = ^-uT ' 

where the last inequality holds due to the induction assumption. 
Based on Eq. d!4t . we have 

Fi(S') - Fi(T) = J2uev\T h uT ~ E ue y\s 
< E uG v\t (^t - ft£s) < 0. 

Thus, Fi (S) is a non-increasing set function as desired. Finally, 
we prove the submodularity property of F\(S). Let T u = T U {u} 
and S„ = 5" U {u}. Let a u (S) = Fi (S u ) - Fi (S) be the marginal 
gain. Then, we have 



and 

cr u (T) = /i^ T — • 

To prove the submodularity of Fi(S), we show o u (T) < a u (S) 
as follows: 

&u(S) - a u (T) 

= (J2m£V\S — EuiGV\T h w r) 

— (Em6V\S„ ^iuS u — Eu,6V\T„ ^»T„) 
= E»,gT\S (^mS — h wT ) — E^uj S t\S (h m s u ~h W T u ) 
= Eu)6T\S (^S ~ ^-u)S u ) > 0. 

(15) 

Since E^stxs ft »T = and E„e T \s = by Eq. ©, 

the third equality of the above equation holds. To prove the last 
inequality of Eq. d 1 Sfc . we can use a similar induction argument 
which is applied to prove Eq. d!4t . We omit the details for brevity. 
Put it all together, we conclude that Fi(S) is a non-increasing sub- 
modular set function with Fi(0) = 0. Therefore, the theorem is 
established. 

Proof of Theorem 13.21 First, by definition, X^ s equals to zero 
if S = 0, which results in 7 r 2(0) = 0. Second, we show the 
non-increasing property of F2(S). Let S C T C V be two 
subsets of V. By the linearity of expectation, we have F2 (S) = 
Y, we v E ( X ™s) = E^evP^s- Let ptv be the probability of 
that w hits v by an L-length random walk. Then, we have p^ s = 
1 ~ lives (1 _ P™-")- F urt her, we have 

F2(S)-F 2 (T) = £„ e y(P» S -P»T) 

= E„ e v ((! " n, es (1 " - n, eT (1 - PL))) 

= E» e v(n„ eT (i - pL) - n„ es (i - pL)) < 0. 

Therefore, F2(S) is a non-increasing set function. Finally, we 
prove that F2(S) is a submodular set function. Let u G V\T, 
S u = S U {u}, and T u = T U {it}. Further, we let = 
^2(5 U {«}) — F2(S) be the marginal gain. Then, we have 

p^-n P „(n M^-^-n^ 

and 

In the following, we show that Pu(S) > p u (T). Specifically, we 
have 

p u (S)-p u (T) 

= e^v ((n, es (! - p-) - n„ eT (i - pD) 
-(u veSu c 1 - pS.) - n„ eT „ (1 - p£«))) 

= e» £ v ((i - n„ ens (i - pL)) n„ 6S (i - pL) 
-(1 - n, 6 T\s ( : - p ^ n„ eSu (1 - pD 

= EujgV ((1 _ l~IugT\S 0- _ Pwv))Pwu WveS (1 _ Ptov) 

> 0. 

This completes the proof. 
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